Before You Begin

(1) Load packages and set working directory location

  • commented lines install required packages--uncomment and run if a package is not installed
#install.packages("tidyverse")
library(tidyverse)
#install.packages("readxl")
library(readxl)
#install.packages("knitr")
library(knitr)
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
library(BiocManager)
# BiocManager::install("DESeq2")
library(DESeq2)
# BiocManager::install("WGCNA")
library(WGCNA)
# BiocManager::install("biomaRt")
library(biomaRt)
# BiocManager::install("preprocessCore")
library(preprocessCore)
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
# install.packages("LDlinkR")
library(LDlinkR)
# install.packages("httr")
library(httr)
# install.packages("coloc")
library(coloc)
# install.packages("devtools")
library(devtools)
# devtools::install_github("oliviasabik/GTExIdConverter")
# devtools::install_github("slowkow/ggrepel")
# devtools::install_github("oliviasabik/RACER")
library(GTExIdConverter)
library(RACER)
#BiocManager::install("PhenStat")
library(PhenStat)

# set the working directory to the directory where this notebook is saved
setwd("~/Desktop/starprotocol_sabik2020/")
opts_knit$set(root.dir = '~/Desktop/starprotocol_sabik2020/')

(2) Identify a GWAS study

Here we read in the summary statistics from a genome-wide association study for bone mineral density. This protocol requires that the GWAS summary statistics are available for the study of interest.

url = "http://www.gefos.org/sites/default/files/BEurope-Bmd-As-C-Gwas-SumStats.txt_0.gz"
gwas_sumstat = read_tsv(url)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   SNPID = col_character(),
##   SNP = col_character(),
##   CHR = col_double(),
##   BP = col_double(),
##   ALLELE1 = col_character(),
##   ALLELE0 = col_character(),
##   A1FREQ = col_double(),
##   INFO = col_double(),
##   BETA = col_double(),
##   SE = col_double(),
##   P = col_double(),
##   N = col_double()
## )

(3) Acquire RNA-seq data in the relevant tissue or cell type for your trait of interest

Here, we load a expression matrix, derived from murine osteoblasts (bone forming cells). Combining the results of the bone mineral density GWAS with a gene co-expression network derived from mineralizing osteoblasts will provide a platform for developing hypotheses about the role of specific mineralization-related genes that are correlated with differences in bone mineral density.

# Loading in our expression matrix GxS with G columns, representing genes and S rows, representing samples
load(url("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/expression_matrix/exp_mat.Rdata"))
# formatting exp_mat, removing strain as a column to row names
rownames(exp_mat) <- exp_mat$strain
exp_mat <- exp_mat[,-1]

(4) Pre-processing RNA-seq data for co-expression network construction

(4a) Normalizing RNA-seq data

Here, we provide examples of pre-processing steps that have already been conducted on the matrix we loaded above.

# this chunk is not run because these steps were already applied to the expression matrix we've loaded in

#apply a variance stabilizing transformation to get normalized exp_mat
varianceStabilizingTransformation(object = exp_mat)
# or apply a log transformation to get normalized exp_mat
log2(exp_mat + 1)

(4b) Removing batch effects via PEER

Here, we describe batch correction via PEER, which has already been applied to the matrix we loaded above. The PEER tool is not

# At the time of publication, PEER was not wrapped for R, and the python module was use instead, however, PEER is now available as an R package. It must be downloaded from the project GitHub page, but is in submission to CRAN. 

# A detailed wiki for how to use PEER can be found on the PEER Github Page: https://github.com/PMBio/peer/wiki/Tutorial. 

(4c) Quantile normalization

Here, we conduct quantile normalization on our expression matrix.

# Finally, we perform quantile normalization
norm_exp_mat <- normalize.quantiles(as.matrix(exp_mat))
colnames(norm_exp_mat) <- colnames(exp_mat)
rownames(norm_exp_mat) <- rownames(exp_mat)
norm_exp_mat <- as.data.frame(norm_exp_mat)
norm_exp_mat[1:5,1:5]

(5) Curating Lists of Known Disease and Phenotype Associated Genes

In this section, we load curated lists of genes known to be associated with monogenic bone disorders in humans and in abberant bone phenotypes in mouse models. These lists were curated by subject matter experts from literature and databases of bone-related phenotype screens.

# read in annotated tables of genes
url = "https://www.cell.com/cms/10.1016/j.celrep.2020.108145/attachment/ae828f98-acd0-40e7-aa41-14dd7fffefc0/mmc6.xlsx"
download.file(url = url, destfile = "./data/gene_lists/gene_lists.xlsx")
phen_genes = read_excel("./data/gene_lists/gene_lists.xlsx", sheet = 1)
monogenic_genes = read_excel("./data/gene_lists/gene_lists.xlsx", sheet = 2)

# convert genes to vector of gene identifiers
len_phen = length(phen_genes$gene_name)
phen_genes = phen_genes$gene_name[-c((len_phen-7):len_phen)]
len_mono = length(monogenic_genes$gene)
monogenic_genes = monogenic_genes$gene[-c((len_mono-10):len_mono)]

Main Protocol

(1) Construct a co-expression network

Here, we apply weighted gene coexpression network analysis to the expression matrix. This section heavily represents the content from the WGCNA tutorials created by the author's of WGCNA, found here: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/

Setting WGCNA options

options(stringsAsFactors = FALSE)

Quality Control of Expression Data

In this section, the samples in the expression matrix are clustered and plotted to identify outliers.

#Using a cluster tree to find sample outliers
sampleTree = hclust(dist(norm_exp_mat), method = "average")
# sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, 
     cex.axis = 1.5, cex.main = 2)

Power Calculation for Network Construction

Next, the power used for calculating the network is empirically chosen by calculating scale-free topology and mean connectivity.

powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(norm_exp_mat, powerVector = powers, verbose = 5, networkType = "signed")
## pickSoftThreshold: will use block size 1529.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 1529 of 29255
## Warning: executing %dopar% sequentially: no parallel backend registered
##    ..working on genes 1530 through 3058 of 29255
##    ..working on genes 3059 through 4587 of 29255
##    ..working on genes 4588 through 6116 of 29255
##    ..working on genes 6117 through 7645 of 29255
##    ..working on genes 7646 through 9174 of 29255
##    ..working on genes 9175 through 10703 of 29255
##    ..working on genes 10704 through 12232 of 29255
##    ..working on genes 12233 through 13761 of 29255
##    ..working on genes 13762 through 15290 of 29255
##    ..working on genes 15291 through 16819 of 29255
##    ..working on genes 16820 through 18348 of 29255
##    ..working on genes 18349 through 19877 of 29255
##    ..working on genes 19878 through 21406 of 29255
##    ..working on genes 21407 through 22935 of 29255
##    ..working on genes 22936 through 24464 of 29255
##    ..working on genes 24465 through 25993 of 29255
##    ..working on genes 25994 through 27522 of 29255
##    ..working on genes 27523 through 29051 of 29255
##    ..working on genes 29052 through 29255 of 29255
##    Power SFT.R.sq  slope truncated.R.sq  mean.k. median.k.  max.k.
## 1      1   0.0985 -74.30          0.978 14600.00  14600.00 14900.0
## 2      2   0.3610 -59.50          0.873  7540.00   7530.00  7900.0
## 3      3   0.6250 -41.80          0.890  3990.00   3980.00  4390.0
## 4      4   0.6770 -27.10          0.901  2170.00   2160.00  2550.0
## 5      5   0.7300 -19.30          0.925  1210.00   1200.00  1540.0
## 6      6   0.7800 -14.40          0.945   687.00    678.00   967.0
## 7      7   0.8490 -10.60          0.938   399.00    391.00   632.0
## 8      8   0.9160  -8.79          0.954   237.00    230.00   448.0
## 9      9   0.9600  -7.29          0.968   143.00    138.00   333.0
## 10    10   0.9810  -6.03          0.977    88.20     83.60   258.0
## 11    12   0.9920  -4.38          0.994    35.30     32.30   174.0
## 12    14   0.9910  -3.37          0.994    15.20     13.10   131.0
## 13    16   0.9900  -2.75          0.989     7.06      5.61   105.0
## 14    18   0.9770  -2.36          0.970     3.53      2.50    87.2
## 15    20   0.9710  -2.08          0.964     1.91      1.16    74.1
# Plot the results:
# sizeGrWindow(9, 5)
# par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence")); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");abline(h=0.9,col="red")

# this line corresponds to using an R^2 cut-off of h
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"));text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

#### power = 8
#### lowest power to provide a scale-free topology fit of at least 0.9 
#### Low mean connectivity score

Calculating the Network

Using the lower power that results in a scale-free topology fit of at least 0.9, and thus a low mean connectivity score, the network is calculated.

net = blockwiseModules(norm_exp_mat, power = 8,
                       TOMType = "signed", minModuleSize = 20,
                       networkType = "signed",
                       reassignThreshold = 0, mergeCutHeight = 0.15,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "./data/network/norm_exp_mat_TOM",
                       verbose = 3)
##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ....pre-clustering genes to determine blocks..
##    Projective K-means:
##    ..k-means clustering..
##    ..merging smaller clusters...
## Block sizes:
## gBlocks
##    1    2    3    4    5    6 
## 4996 4980 4964 4880 4739 4696 
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file ./data/network/norm_exp_mat_TOM-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 280 genes from module 1 because their KME is too low.
##      ..removing 318 genes from module 2 because their KME is too low.
##      ..removing 77 genes from module 3 because their KME is too low.
##      ..removing 499 genes from module 4 because their KME is too low.
##      ..removing 168 genes from module 5 because their KME is too low.
##      ..removing 121 genes from module 6 because their KME is too low.
##      ..removing 38 genes from module 7 because their KME is too low.
##      ..removing 2 genes from module 8 because their KME is too low.
##      ..removing 2 genes from module 9 because their KME is too low.
##  ..Working on block 2 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 2 into file ./data/network/norm_exp_mat_TOM-block.2.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 530 genes from module 1 because their KME is too low.
##      ..removing 654 genes from module 2 because their KME is too low.
##      ..removing 96 genes from module 3 because their KME is too low.
##      ..removing 188 genes from module 4 because their KME is too low.
##      ..removing 21 genes from module 5 because their KME is too low.
##  ..Working on block 3 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 3 into file ./data/network/norm_exp_mat_TOM-block.3.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 1156 genes from module 1 because their KME is too low.
##      ..removing 1071 genes from module 2 because their KME is too low.
##  ..Working on block 4 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 4 into file ./data/network/norm_exp_mat_TOM-block.4.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 562 genes from module 1 because their KME is too low.
##      ..removing 360 genes from module 2 because their KME is too low.
##      ..removing 264 genes from module 3 because their KME is too low.
##      ..removing 215 genes from module 4 because their KME is too low.
##      ..removing 151 genes from module 5 because their KME is too low.
##      ..removing 79 genes from module 6 because their KME is too low.
##      ..removing 22 genes from module 7 because their KME is too low.
##      ..removing 6 genes from module 8 because their KME is too low.
##      ..removing 3 genes from module 9 because their KME is too low.
##  ..Working on block 5 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 5 into file ./data/network/norm_exp_mat_TOM-block.5.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 657 genes from module 1 because their KME is too low.
##      ..removing 269 genes from module 2 because their KME is too low.
##      ..removing 119 genes from module 3 because their KME is too low.
##      ..removing 130 genes from module 4 because their KME is too low.
##      ..removing 126 genes from module 5 because their KME is too low.
##      ..removing 64 genes from module 6 because their KME is too low.
##      ..removing 53 genes from module 7 because their KME is too low.
##      ..removing 36 genes from module 8 because their KME is too low.
##      ..removing 18 genes from module 9 because their KME is too low.
##      ..removing 33 genes from module 10 because their KME is too low.
##      ..removing 35 genes from module 11 because their KME is too low.
##      ..removing 40 genes from module 12 because their KME is too low.
##      ..removing 13 genes from module 13 because their KME is too low.
##      ..removing 26 genes from module 14 because their KME is too low.
##      ..removing 6 genes from module 15 because their KME is too low.
##      ..removing 5 genes from module 16 because their KME is too low.
##      ..removing 5 genes from module 17 because their KME is too low.
##      ..removing 3 genes from module 18 because their KME is too low.
##      ..removing 6 genes from module 19 because their KME is too low.
##      ..removing 3 genes from module 20 because their KME is too low.
##      ..removing 2 genes from module 21 because their KME is too low.
##  ..Working on block 6 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 6 into file ./data/network/norm_exp_mat_TOM-block.6.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 1087 genes from module 1 because their KME is too low.
##      ..removing 681 genes from module 2 because their KME is too low.
##      ..removing 158 genes from module 3 because their KME is too low.
##      ..removing 126 genes from module 4 because their KME is too low.
##      ..removing 58 genes from module 5 because their KME is too low.
##      ..removing 10 genes from module 6 because their KME is too low.
##      ..removing 7 genes from module 7 because their KME is too low.
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.15
##        Calculating new MEs...
table(net$colors)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 10659  1463  1274  1077   894   806   788   751   708   690   661   635   631 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##   527   521   488   469   438   432   406   372   297   287   282   276   270 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
##   245   231   218   213   188   143   132   124   120   116    97    97    90 
##    39    40    41    42    43    44    45    46    47    48    49    50    51 
##    89    83    80    77    76    60    60    58    56    55    54    54    48 
##    52    53    54    55    56    57    58 
##    47    47    45    44    40    37    29

Saving the Network

The components of the network required for downstream analyses are saved so the network does not need to be calculated multiple times.

moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
MEs = net$MEs;
dim(MEs)
## [1] 42 59
geneTree = net$dendrograms[[1]];
save(sft, MEs, moduleLabels, moduleColors, geneTree,
     file = "./data/network/network.RData")

Loading the Network

In order to return to this script and run downstream analyses without recalculating the network, the necessary artifacts can be loaded using this code.

load("./data/network/network.RData")

Annotating the Network

In order to interpret the network it is useful to label the genes in the network with identifiers, e.g. Ensembl gene IDs, entrez IDs, etc. Using the biomaRt package, mappings from those identifiers to gene symbols can be acquired. These steps will vary based on the organism you use and the gene identifiers used in the list of phenotype and disease associated genes generated above.

# establish the database that matches your organism
ensembl = useMart("ensembl")
mm19 = useDataset(mart = ensembl, dataset = "mmusculus_gene_ensembl")

# These functions may be used to determine the attributes and filters to apply in the database query
# listAttributes(mm19)
# listFilters(mm19)

# Next, query the BioMart database for annotations for the genes of interest and merge the output of the query with the network results
m = getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "description", "chromosome_name", "start_position", "end_position"),
          filters = c("ensembl_transcript_id"),
          values = colnames(norm_exp_mat),
          mart = mm19)

gene_MEs <- rbind(colnames(norm_exp_mat), moduleColors, moduleLabels)
gene_MEs <- t(gene_MEs)
colnames(gene_MEs)[1:3] <- c("ensembl_transcript_id", "mod_color", "mod_number")
gene_MEs <- as.data.frame(gene_MEs)
annotated_mods = merge(gene_MEs, m, by = "ensembl_transcript_id")

# now you can filter annotated_mods to see the genes in each co-expression module
annotated_mods %>%
  filter(mod_number == 1)

(2) Gene Ontology Analysis

In this section, we describe the process of identifying gene ontology processes that are enriched in each coexpression module, and look at an example results table.

While tools exist for performing gene ontology analysis in R, our tool of preference for this project is ToppFun, of the ToppGene suite: https://toppgene.cchmc.org/enrichment.jsp, which is a web interface tool. ToppGene is unique in that it not only searches for enriched gene ontology categories and pathways for your genes of interest, but also human and mouse phenotypes, publications and published coexpression data sets, gene families, microRNAs, drugs, and diseases.

ToppFun reports back the significance of each enrichment with an assortment of multiple testing correction methods, the number of hits for that category from your query ("Hit Count in Query List"), and the total number of genes in the category ("Hit Count in Genome").

# You can extract the gene names for a module by writing the module out to a .csv file which can be fed into external tools, like ToppGene
annotated_mods %>%
  filter(mod_number == 1) %>%
  write_csv(., "./data/network/mod_1.csv")

# From ToppFun, you can click the "Download All" button from your results page, and get all of the results, so you can save them, browse them in R, and compare different modules in R
url = "https://www.cell.com/cms/10.1016/j.celrep.2020.108145/attachment/60cdba8a-aa04-4c22-8ca4-4f8f0e6173dd/mmc3.xlsx"
download.file(url = url, destfile = "./data/gene_ontology/toppfun_res.xlsx")
toppfun_1 = read_excel("./data/gene_ontology/toppfun_res.xlsx", sheet = 1, skip = 1)
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in J2086 / R2086C10: got 'NA'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in J2211 / R2211C10: got 'NA'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3600 / R3600C3: got 'Color representing the co-expression
## module used for GO analysis'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3601 / R3601C3: got 'Number of genes in each co-expression
## module'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3602 / R3602C3: got 'Category of query GO term'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3603 / R3603C3: got 'GO ID'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3604 / R3604C3: got 'Query GO term'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3605 / R3605C3: got 'P-value of enrichment'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3606 / R3606C3: got 'Bonferroni corrected p-value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3607 / R3607C3: got 'Benjamini-Hochberg corrected p-value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3608 / R3608C3: got 'Benjamini–Yekutieli corrected p-
## value'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3609 / R3609C3: got 'Number of module genes in GO category
## list'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3610 / R3610C3: got 'Number of gene in the GO cateogry'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in C3611 / R3611C3: got 'Genes from query list overlapping
## hits'
toppfun_1

(3) Creating GWAS gene list

Next, a list of genes implicated by the GWAS we loaded above will be generated. In the case of the GWAS we used, fine mapping was performed to identify a subset of significantly associated SNPs that are indepedent associations and lead SNPs, so here, we'll read in the lead SNP file and use it to subset the full summary statistics for the study.

Generating GWAS regions (SNPs with LD > 0.7 for GWAS SNPs)

## Reading in lead SNPs
url = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5621629/bin/NIHMS73769-supplement-Supplementary_Table_2.xlsx"
download.file(url = url, destfile = "./data/gwas/gwas_bmd_lead_snps.xlsx")
lead_snps = read_excel("./data/gwas/gwas_bmd_lead_snps.xlsx", sheet = 1, skip = 3)
bmd_snps <- subset(gwas_sumstat, gwas_sumstat$SNP %in% lead_snps$RSID)

# Some of the variants are not SNPs, but indels. LDLink recognizes them by a different naming convention, so we have to modify their identifiers for LDLink to find their proxies
bmd_snps = bmd_snps %>%
  mutate(SNP = word(bmd_snps$SNP,1,sep = "_"))

bmd_snps$SNP[grep(":", bmd_snps$SNP)]
## [1] "1:22483649"  "10:29087203" "17:27961561" "22:19677948"
bmd_snps$SNP[6] = paste0("chr", bmd_snps$SNP[6])
bmd_snps$SNP[163] = paste0("chr", bmd_snps$SNP[163])
bmd_snps$SNP[260] = paste0("chr", bmd_snps$SNP[260])
bmd_snps$SNP[301] = paste0("chr", bmd_snps$SNP[301])

# Next we batch query LDLink to get all of the proxy SNPs
LDproxy_batch(snp = bmd_snps$SNP, pop = "EUR", r2d = "r2", token = "c0f613f149ab", append = TRUE)
##   error: rs191147097 is monoallelic in the EUR population.,
##   error: rs149333699 is monoallelic in the EUR population.,
##   error: rs184953495 is monoallelic in the EUR population.,
##   error: rs10668066 is not a biallelic variant.,
##   error: rs143581991 is not in 1000G reference panel.,
##   error: rs746652558 is not in 1000G reference panel.,
##   error: rs12806687 is not in 1000G reference panel.,
##   error: rs72186592 is not in dbSNP build 151.,
##   error: rs202234992 is not in 1000G reference panel.,
# some SNPs will return an error in the LD search. We will stil annotate the nearest up and downstream genes from these SNPs, but their window will just be their coordinate location
no_ld_snps = c("rs191147097", "rs149333699", "rs184953495", "rs10668066",
               "rs143581991", "chr10:29087203", "rs12806687", "rs72186592", "rs202234992")

# we format these so we can combine them with the LD 0.7 regions below
no_ld_regions = bmd_snps %>%
  filter(SNP %in% no_ld_snps) %>%
  dplyr::rename(query_snp = SNP) %>%
  group_by(query_snp) %>%
  summarize(min = min(BP), max = max(BP), chr = paste0("chr",max(CHR)))

# We read back in the results of the LDLink proxy query
proxies = read_tsv("./combined_query_snp_list.txt", skip = 1, col_names = FALSE)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_double(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_double(),
##   X10 = col_character(),
##   X11 = col_character(),
##   X12 = col_character()
## )
colnames(proxies) = c("index", "query_snp", "rs_id", "coord", "alleles", "maf", "distance", "d_prime", "r2", "correlated_alleles", "regulome_db", "function")

# now we filter for proxies with R2 greater than 0.7, and for each query_snp we will create an entry in our regions table with the max and min distance from the 
ld_regions = proxies %>%
  filter(r2 >= 0.7) %>%
  separate(col = coord, sep = ":", into = c("chr", "coord")) %>%
  group_by(query_snp) %>%
  summarize(chr = max(chr), min = as.numeric(min(coord)), max = as.numeric(max(coord)))

# then we combine both sets to get one set of regions of LD >= 0.7 for each GWAS lead SNP
ld07_gwas_regions = bind_rows(no_ld_regions, ld_regions) %>%
  dplyr::select(chr, start = min, end = max, query_snp)

# make sure none of the regions have a negative width
for(i in 1:nrow(ld07_gwas_regions)){
  if((ld07_gwas_regions[i,3]-ld07_gwas_regions[i,2]) < 0){
    print(paste0("swap row #", i))
    start = ld07_gwas_regions[i,3]
    end = ld07_gwas_regions[i,2]
    ld07_gwas_regions[i,3] = end
    ld07_gwas_regions[i,2] = start
  }
}
## [1] "swap row #46"

Intersection GWAS regions with reference gene file

Next, we want to find the intersection between these regions and the full human geneset. Our gene set was generated from the hg19 reference genome using the UCSC gene tables. It is important to match the coordinates of the genome used in the GWAS study to the gene set

## Reading in UCSC RefSeq gene table ##
url = "https://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz"
genes = read_tsv(url, col_names = FALSE)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_double(),
##   X6 = col_double(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_double(),
##   X10 = col_number(),
##   X11 = col_number(),
##   X12 = col_double(),
##   X13 = col_character(),
##   X14 = col_character(),
##   X15 = col_character(),
##   X16 = col_character()
## )
colnames(genes) = c("bins", "refseq_id", "chr", "strand", "txStart", "txEnd", "cdsStart", "cdsEnd", "exonCount", "ExonStarts", "ExonEnds", "score", "name", "cdsStartStat", "cdsEndStat", "exonFrames")
genes = genes[,c(3,5,6,2,13)]
genes = genes %>%
  filter(str_detect(refseq_id, "NM_"))

# check that the column types are correct and chromosome notation matches
head(genes)
head(ld07_gwas_regions)
# Next, we're going to use the Genomic Ranges package to annotate our ranges with genes
gr_genes <- GRanges(genes)
gr_gwas <- GRanges(ld07_gwas_regions)

# We find the genes that overlap the ranges from the GWAS study
overlaps <- GenomicRanges::findOverlaps(gr_genes, gr_gwas)
overlaps_info = cbind(ld07_gwas_regions[overlaps@to,], genes[overlaps@from,]$name)
overlaps_info = overlaps_info[,c(1,4,5)]
overlaps_info$type = "overlaps"
overlaps_info = overlaps_info[!duplicated(overlaps_info), ]
colnames(overlaps_info) = c("chromosome", "gwas_snp", "gene_name", "type")

# We find the nearest genes preceding the ranges from the GWAS study
nearest_precede <- GenomicRanges::precede(gr_gwas, gr_genes)
precede_info = cbind(ld07_gwas_regions, genes[nearest_precede,])
precede_info = precede_info[,c(1,4,9)]
precede_info$type = "precede"
precede_info = precede_info[!duplicated(precede_info), ]
colnames(precede_info) = c("chromosome", "gwas_snp", "gene_name", "type")
precede_info = precede_info[-which(precede_info$morris_snp %in% overlaps_info$gwas_snp),]

# We find the nearest genes following the ranges from the GWAS study
nearest_follow <- GenomicRanges::follow(gr_gwas, gr_genes)
follow_info = cbind(ld07_gwas_regions, genes[nearest_follow,])
follow_info = follow_info[,c(1,4,9)]
follow_info$type = "follow"
follow_info = follow_info[!duplicated(follow_info), ]
colnames(follow_info) = c("chromosome", "gwas_snp", "gene_name", "type")
follow_info = follow_info[-which(follow_info$morris_snp %in% overlaps_info$gwas_snp),]

# Finally, we combine these results into one table
gwas_region_genes = rbind(overlaps_info, precede_info, follow_info)
gwas_region_genes

Identify mouse homologs for GWAS genes

Now, we now have a list of human genes that overlap the regions as defined by LD from our study of interest, however we may also need the equivalent list in mouse. We can use a homology map to generate a list of mouse homologs for the human gene list from MGI (http://www.informatics.jax.org/faq/ORTH_dload.shtml)

## Reading in homology map ##
url = "http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt"
homology <- read_tsv(url)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `HomoloGene ID` = col_double(),
##   `Common Organism Name` = col_character(),
##   `NCBI Taxon ID` = col_double(),
##   Symbol = col_character(),
##   `EntrezGene ID` = col_double(),
##   `Mouse MGI ID` = col_character(),
##   `HGNC ID` = col_character(),
##   `OMIM Gene ID` = col_character(),
##   `Genetic Location` = col_character(),
##   `Genomic Coordinates (mouse: , human: )` = col_character(),
##   `Nucleotide RefSeq IDs` = col_character(),
##   `Protein RefSeq IDs` = col_character(),
##   `SWISS_PROT IDs` = col_character()
## )
homology
gwas_gene_hom <- homology %>%
  filter(Symbol %in% gwas_region_genes$gene_name)

gwas_mouse_hom <- homology %>%
 filter(`HomoloGene ID` %in% gwas_gene_hom$`HomoloGene ID`) %>%
 filter(`NCBI Taxon ID` == 10090)

gwas_mouse_genes = gwas_mouse_hom$Symbol 

(4) Module Enrichment

In this section, we conduct Fisher's Exact tests for module enrichment for GWAS genes, disease genes, and phenotype associated genes and generate tables and plots of the results.

Identify modules enriched for GWAS genes

# create list of all modules
colors = unique(moduleColors)

# create object for each module and compute enrichment of overlap between module and GWAS gene list
for (i in 1:length(colors)){
  color = colors[i]
  matrix_name = paste("mod_",color,"_gene_exp", sep = "")
  assign(paste0("mod_",color,"_gene_exp"), subset(gene_MEs, gene_MEs$mod_color == color))
  assign(paste0("mod_",color,"_gene_exp"), as.data.frame(get(paste0("mod_",color,"_gene_exp"))))
  assign(paste0("mod_",color,"_gene_ids"), rownames(get(paste0("mod_",color,"_gene_exp"))))
  assign(paste0("mod_",color,"_trx_info"), annotated_mods %>% filter(mod_color == color))
  d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
  x = unique(d$external_gene_name) %in% gwas_mouse_genes
  x = sum(x == TRUE)
  a <- x
  b <- (417 - a)
  c <- (dim(d)[1] - a)
  e <- (29255 - c - b - a)
  assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE), 
                      alternative='greater'))
}

# format results in table
gwas_enrichment_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(gwas_enrichment_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
  color = colors[i]
  ft = get(paste0("ft_mod_",color))
  gwas_enrichment_results[i,1] = color
  gwas_enrichment_results[i,2] = ft$p.value
  gwas_enrichment_results[i,3] = ft$estimate
}

# Computed FDR values, -log10(p-values), and view results
gwas_enrichment_results$p.adj = p.adjust(gwas_enrichment_results$p_value, method = "fdr", n = length(gwas_enrichment_results$p_value))
gwas_enrichment_results$neg_log10 = -log10(gwas_enrichment_results$p.adj)
gwas_enrichment_results %>% arrange(p.adj)

Plot the module enrichment and significance for GWAS genes

gwas_enrichment_results %>%
  ggplot(aes(x = odds_ratio, y = neg_log10)) +
  geom_point(aes(color = module_color), size = 3) + 
  geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
  geom_hline(yintercept = -log10(0.05), color = "red") +
  scale_colour_identity()

Browse enriched modules

The annotated modules can be retrieved so the genes in the enriched modules can be browsed.

# Filtering annotated modules to browse one module
annotated_mods %>%
  filter(mod_color == "tan")

Identify modules enriched for genes associated with monogenic bone disorders

In addition to identifying modules enriched for GWAS genes, modules enriched for genes known to cause related monogenic diseases can be identified

# First, the human monogenic gene IDs are converted to mouse IDs
mono_gene_hom <- homology %>%
  filter(Symbol %in% monogenic_genes)
mono_mouse_hom <- homology %>%
 filter(`HomoloGene ID` %in% mono_gene_hom$`HomoloGene ID`) %>%
 filter(`NCBI Taxon ID` == 10090)
mono_mouse_genes = mono_mouse_hom$Symbol 

# Then the loop is run to identify enriched modules
for (i in 1:length(colors)){
  color = colors[i]
  d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
  x = unique(d$external_gene_name) %in% mono_mouse_genes
  x = sum(x == TRUE)
  a <- x
  b <- (40 - a)
  c <- (dim(d)[1] - a)
  e <- (29255 - a - b -c)
  assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE), 
                      alternative='greater'))
}

# the results table is formatted
mono_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(mono_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
  color = colors[i]
  enrichment = get(paste0("ft_mod_",color))
  mono_results[i,1] = color
  mono_results[i,2] = enrichment$p.value
  mono_results[i,3] = enrichment$estimate
}

mono_results$p.adj = p.adjust(mono_results$p_value, method = "fdr", n = length(mono_results$p_value))
mono_results$neg_log10 = -log10(mono_results$p.adj)
filter(mono_results, mono_results$p.adj < 0.05) %>%
  arrange(desc(odds_ratio))

Plot the module enrichment and significance for monogenic disease genes

mono_results %>%
  ggplot(aes(x = odds_ratio, y = neg_log10)) +
  geom_point(aes(color = module_color), size = 3) + 
  geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
  geom_hline(yintercept = -log10(0.05), color = "red") +
  scale_colour_identity()

Identify modules enriched for genes associated with bone phenotypes in model organisms

Finally, modules enriched for genes known to cause related phenotypes in model organisms are identified

# Then the loop is run to identify modules enriched for phenotype associated genes
for (i in 1:length(colors)){
  color = colors[i]
  d = as.data.frame(get(paste0("mod_",color,"_trx_info")))
  x = unique(d$external_gene_name) %in% phen_genes
  x = sum(x == TRUE)
  a <- x
  b <- (1088 - a)
  c <- (dim(d)[1] - a)
  e <- (29255 - a - b -c)
  assign(paste0("ft_mod_",color), fisher.test(matrix(c(a,b,c,e),2,2,byrow=TRUE), 
                      alternative='greater'))
}

# the results table is formatted
phen_results = as.data.frame(matrix(nrow=59,ncol=3))
colnames(phen_results) <- c('module_color','p_value','odds_ratio')
for (i in 1:length(colors)){
  color = colors[i]
  enrichment = get(paste0("ft_mod_",color))
  phen_results[i,1] = color
  phen_results[i,2] = enrichment$p.value
  phen_results[i,3] = enrichment$estimate
}

phen_results = arrange(phen_results, desc(odds_ratio))
phen_results$p.adj = p.adjust(phen_results$p_value, method = "fdr", n = length(phen_results$p_value))
phen_results$neg_log10 = -log10(phen_results$p.adj)
filter(phen_results, phen_results$p.adj < 0.05)

Plot the module enrichment and significance for genes associated with bone phenotype from model systems

phen_results %>%
  ggplot(aes(x = odds_ratio, y = neg_log10)) +
  geom_point(aes(color = module_color), size = 3) + 
  geom_point(color = "black", size = 3, pch = 21) + theme_bw() +
  geom_hline(yintercept = -log10(0.05), color = "red") +
  scale_colour_identity()

(5) LD Score Regression

LD score regression (ldsc package) is used to calculated the partitioned heritability attributed to SNPs surrounding the genes that compose each module. The github repo and the wiki for the ldsc package has detailed tutorials for how to carry out this analysis. It is not wrapped for R, but can be run on the command line using python. The general steps taken are outlined here:

# (1) First, we need to format the GWAS summary statistics for input into the lsdc algorithm

## munge sumstats, returns a file called out.sumstats.gz
# ./munge_sumstats.py \
# --out BMD \
# --merge-alleles w_hm3.snplist \
# --a1-inc  \
# --sumstats bmd_gwas_sumstats.txt

# (2) Generate genesets for each module for all chromosomes; this requires a list of gene identifiers for the genes in each module, a file indicated the coordinates for each identifier, and the plink .bim files of pre-computed LD scores. In this application we use the 1000 genomes plink files, the Genesets are listed as Ensembl gene IDs, and the gene coordinate file "ENSG_coord.txt", provided with the ldsc package, and we use a windowsize of 100,000, as recommended by the authors of the application

## make annotations for each module and each chromosome
# python ../../src/ldsc/make_annot.py --gene-set-file violet_module_human_gene_ids.Geneset --gene-coord-file ENSG_coord.txt --windowsize 100000 --bimfile ./1000G_plinkfiles/1000G.mac5eur.1.bim --annot-file ./violet_annot/violet_module.1.annot.gz

# (3) Finally, using all of these annotations, run ldsc using the processed summary statistics, the base annotation paths for the modules' annotations, the SNP weights and frequencies for the European 1000 Genomes data that are provided with the ldsc package, the overlap annotations was used because transcripts were used to generate the co-expression network, so the gene sets are non-disjoint, and finally, a base name for the output is provided.

## compute LDSC, outputs out.log and out.results 
# python ldsc.py 
#   --h2 BMD.sumstats.gz\
#   --ref-ld-chr antiquewhite4_module.,
#     bisque4_module.,black_module.,
#     blue_module., ...etc.\ 
#   --w-ld-chr ./weights_hm3_no_hla/weights.
#   --overlap-annot\
#   --frqfile-chr 1000G.mac5eur.\
#   --out BMD_all_modules_compare

### The output of the last step includes a log, which echos the command used to run the regression and provides a summary of the results, and a results table, which reports the proportion of SNPs, the proportion of heritability, the standard error fo the proportion of heritability, the heritability, the standard error of the heritability, and a p-value indicating the statistical significance of the enrichment

#ldsc_log = read_lines("./data/ld_score_regression/BMD_all_modules_compare.log")
#ldsc_results = read_tsv("./data/ld_score_regression/BMD_all_modules_compare.results")

# The results table has a category for each annotation provided, but they have generic labels. The log keeps track of the annotation input, so we can get the names from the log, because the order of the annotation matters

# Here is an example of table of results from an ldsc run arranged by p-value
url = "https://www.cell.com/cms/10.1016/j.celrep.2020.108145/attachment/921fc33d-d674-4488-951f-295320b24cea/mmc5.xlsx"
download.file(url = url, destfile = "./data/ld_score_regression/ld_score_reg_res.xlsx")
ldsc_res = as.data.frame(read_excel("./data/ld_score_regression/ld_score_reg_res.xlsx", sheet = 3))

# A p-value adjustment is also applied to correct for testing across all modules
ldsc_res$padj = p.adjust(ldsc_res$Enrichment_p, method = "fdr", n = length(ldsc_res$Enrichment_p))
ldsc_res %>%
  dplyr::select(module, Enrichment, padj) %>%
  filter(padj < 0.05)

(6) Colocalization Analysis

In this section, the coloc package is used to perform genetic colocalization analysis to determine whether eQTL for genes of interest from the core module identified above share common causal variants with QTL for the trait of interest.

Computing colocalization of GTEX eQTL for one gene in all tissues and trait QTL

# For this purpose, the GWAS summary statistics we loaded above are sufficient, however, we must reformat the gwas data for input into coloc
gwas_coloc = gwas_sumstat[c(2,5,6,11,9,7)]
gwas_coloc$MAF = ifelse(gwas_coloc$A1FREQ > 0.5, (1- gwas_coloc$A1FREQ), (gwas_coloc$A1FREQ))

# In addition to the summary statistics for the GWAS, the eQTL for the genes of interest will also need to be read in. These have been filtered from the full set of associations in the GTEx v7 eQTL studies using awk in the command line, e.g. > awk -F "\t" '$1 ~ /ENSG###/ {print}' .txt | awk -F "\t" '{ if(($3 >= lower_coord_limit) && ($3 <= upper_coord_limit)) { print } }' > output_file.txt. Do this for each tissue in GTEx. The tarball you need to download from GTEx is very large, so you will likely want to execute this on a server. The file can be downloaded with this command >wget https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL_all_associations.tar.gz

# We will also need to know about the number of samples used in the eQTL analysis for each tissue, so the key with this data need to be read in
tis = read_tsv("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/tissue_key.txt")

# with these files in hand we can loop over all of the tissues and test colocalization in each tissue
gene_coloc_results = data.frame(matrix(NA, nrow = length(tis$Tissue), ncol = 8)) #empty table to fill with results for each tissue
for (i in 1:length(tis$Tissue)){
  tissue = as.character(tis$Tissue[i])
  z = read_tsv(url(paste0("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/b4galnt3_snps/", tissue, "_b4galnt3_ebmd_snps.txt")), col_names = FALSE) # this will read in the eqtl file
  # formatting
  cols <- c("X2", "X3", "X4", "X5", "X6")
  z$variant_id <- do.call(paste, c(z[cols], sep="_"))
  z = z[,c(1,14,7,8,9,10,11,12,13)]
  colnames(z) = c("gene_id", "variant_id", "tss_distance", "ma_samples", "ma_count", "maf",
                     "pval_nominal", "slope", "slope_se")
  # add in RS_IDs 
  gene_snp_ids = GTExIdConvert(z$variant_id)
  z = merge(z, gene_snp_ids, by = "variant_id")
  tissue_n = as.numeric(tis[which(tis$Tissue == tissue),2])
  # make coloc objects
  gene.coloc = list(pvalues=as.numeric(z$pval_nominal),N=as.numeric(tissue_n),type='quant',
                    snp=as.character(z$rs_id), MAF=as.numeric(z$maf))
  gwas.coloc = list(pvalues=as.numeric(gwas_coloc$P),N=142487,type='quant',
                    snp=as.character(gwas_coloc$SNP), MAF=as.numeric(gwas_coloc$MAF))
  
  # run coloc
  coloc_x = coloc.abf(gene.coloc,gwas.coloc)
  # write out coloc results
  gene_coloc_results[i,] = c(tissue,z$gene_id[1],coloc_x$summary[1],coloc_x$summary[2],coloc_x$summary[3],
                                 coloc_x$summary[4],coloc_x$summary[5],coloc_x$summary[6])
}
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.12e-12  2.56e-06  8.28e-07  1.00e+00  2.97e-06 
## [1] "PP abf for shared variant: 0.000297%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  4.78e-14  2.56e-06  1.86e-08  1.00e+00  6.45e-08 
## [1] "PP abf for shared variant: 6.45e-06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.21e-06  2.24e-07  8.63e-01  8.74e-02  4.99e-02 
## [1] "PP abf for shared variant: 4.99%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.28e-13  1.99e-08  4.97e-08  6.78e-03  9.93e-01 
## [1] "PP abf for shared variant: 99.3%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  5.85e-58  1.90e-08  2.28e-52  6.41e-03  9.94e-01 
## [1] "PP abf for shared variant: 99.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.19e-06  2.56e-07  8.55e-01  9.97e-02  4.49e-02 
## [1] "PP abf for shared variant: 4.49%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.31e-06  1.50e-07  9.00e-01  5.83e-02  4.15e-02 
## [1] "PP abf for shared variant: 4.15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.91e-06  2.72e-07  7.44e-01  1.06e-01  1.50e-01 
## [1] "PP abf for shared variant: 15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.16e-06  8.44e-07  4.53e-01  3.29e-01  2.18e-01 
## [1] "PP abf for shared variant: 21.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.20e-06  2.25e-07  8.57e-01  8.75e-02  5.54e-02 
## [1] "PP abf for shared variant: 5.54%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.04e-06  3.96e-07  7.94e-01  1.54e-01  5.15e-02 
## [1] "PP abf for shared variant: 5.15%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.76e-06  1.48e-07  6.85e-01  5.75e-02  2.58e-01 
## [1] "PP abf for shared variant: 25.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.11e-06  3.18e-07  8.23e-01  1.24e-01  5.27e-02 
## [1] "PP abf for shared variant: 5.27%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.07e-06  2.07e-07  8.06e-01  8.07e-02  1.14e-01 
## [1] "PP abf for shared variant: 11.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.02e-06  3.21e-07  7.87e-01  1.25e-01  8.78e-02 
## [1] "PP abf for shared variant: 8.78%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.27e-06  1.75e-07  8.84e-01  6.81e-02  4.82e-02 
## [1] "PP abf for shared variant: 4.82%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.12e-06  2.17e-07  8.28e-01  8.47e-02  8.70e-02 
## [1] "PP abf for shared variant: 8.7%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.27e-06  1.68e-07  8.85e-01  6.53e-02  4.99e-02 
## [1] "PP abf for shared variant: 4.99%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.06e-06  1.79e-07  8.02e-01  6.95e-02  1.28e-01 
## [1] "PP abf for shared variant: 12.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  4.48e-16  2.56e-06  1.74e-10  1.00e+00  1.00e-08 
## [1] "PP abf for shared variant: 1e-06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.20e-06  1.86e-07  8.59e-01  7.23e-02  6.83e-02 
## [1] "PP abf for shared variant: 6.83%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.21e-06  2.28e-07  8.60e-01  8.90e-02  5.06e-02 
## [1] "PP abf for shared variant: 5.06%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.27e-06  1.57e-07  4.95e-01  6.08e-02  4.44e-01 
## [1] "PP abf for shared variant: 44.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.07e-06  2.74e-07  8.05e-01  1.07e-01  8.83e-02 
## [1] "PP abf for shared variant: 8.83%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  3.14e-07  2.22e-06  1.22e-01  8.65e-01  1.29e-02 
## [1] "PP abf for shared variant: 1.29%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.82e-06  4.32e-07  7.10e-01  1.68e-01  1.21e-01 
## [1] "PP abf for shared variant: 12.1%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.66e-06  7.85e-07  6.48e-01  3.06e-01  4.57e-02 
## [1] "PP abf for shared variant: 4.57%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.89e-06  1.81e-07  7.36e-01  7.02e-02  1.94e-01 
## [1] "PP abf for shared variant: 19.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.18e-06  2.14e-07  8.51e-01  8.32e-02  6.54e-02 
## [1] "PP abf for shared variant: 6.54%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.15e-06  2.68e-07  8.38e-01  1.05e-01  5.73e-02 
## [1] "PP abf for shared variant: 5.73%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.39e-06  3.17e-07  5.42e-01  1.23e-01  3.35e-01 
## [1] "PP abf for shared variant: 33.5%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.19e-06  2.33e-07  8.52e-01  9.09e-02  5.68e-02 
## [1] "PP abf for shared variant: 5.68%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.93e-09  2.48e-06  1.14e-03  9.65e-01  3.35e-02 
## [1] "PP abf for shared variant: 3.35%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.38e-08  2.50e-08  9.30e-03  8.75e-03  9.82e-01 
## [1] "PP abf for shared variant: 98.2%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.12e-06  1.81e-07  8.28e-01  7.06e-02  1.02e-01 
## [1] "PP abf for shared variant: 10.2%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.17e-06  1.96e-07  8.47e-01  7.63e-02  7.67e-02 
## [1] "PP abf for shared variant: 7.67%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.21e-06  1.88e-07  8.60e-01  7.31e-02  6.72e-02 
## [1] "PP abf for shared variant: 6.72%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.84e-06  4.65e-07  7.16e-01  1.81e-01  1.03e-01 
## [1] "PP abf for shared variant: 10.3%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.09e-06  3.33e-07  8.13e-01  1.30e-01  5.69e-02 
## [1] "PP abf for shared variant: 5.69%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.88e-06  1.81e-07  7.32e-01  7.03e-02  1.98e-01 
## [1] "PP abf for shared variant: 19.8%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.55e-16  2.56e-06  9.93e-11  1.00e+00  9.97e-09 
## [1] "PP abf for shared variant: 9.97e-07%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.22e-06  1.80e-07  8.66e-01  7.01e-02  6.40e-02 
## [1] "PP abf for shared variant: 6.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  4.39e-07  2.09e-06  1.71e-01  8.17e-01  1.21e-02 
## [1] "PP abf for shared variant: 1.21%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  1.52e-06  3.64e-07  5.95e-01  1.42e-01  2.64e-01 
## [1] "PP abf for shared variant: 26.4%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.15e-06  1.88e-07  8.39e-01  7.32e-02  8.78e-02 
## [1] "PP abf for shared variant: 8.78%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  2.19e-06  1.99e-07  8.53e-01  7.75e-02  6.93e-02 
## [1] "PP abf for shared variant: 6.93%"
## [1] 1
## [1] 1
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf 
##  5.34e-34  2.56e-06  2.08e-28  1.00e+00  9.96e-09 
## [1] "PP abf for shared variant: 9.96e-07%"

Looking at the coloc output

# take a look at the results, format them, and then look at the significant ones
gene_coloc_results 
colnames(gene_coloc_results) = c("tissue", "gene", "nsnps", "PP.H0.abf", "PP.H1.abf", "PP.H2.abf", "PP.H3.abf", "PP.H4.abf")
subset(gene_coloc_results, PP.H4.abf > 0.50)

Plotting interesting coloc results using RACER package

For interesting coloc results, we can produce a mirror plot, mapping the GWAS summary statistics and the eQTL summary statistics onto the same genome coordinates for comparison

# Choosing a tissue with an interesting coloc result
tissue = "Artery_Coronary"
z = read_tsv(url(paste0("https://github.com/Farber-Lab/STAR_protocols_core_modules/raw/main/data/eqtl_data/b4galnt3_snps/", tissue, "_b4galnt3_ebmd_snps.txt")), col_names = FALSE)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_character(),
##   X2 = col_double(),
##   X3 = col_double(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_double(),
##   X10 = col_double(),
##   X11 = col_double(),
##   X12 = col_double(),
##   X13 = col_double()
## )
cols <- c("X2", "X3", "X4", "X5", "X6")
z$variant_id <- do.call(paste, c(z[cols], sep="_"))
z = z[,c(1,14,7,8,9,10,11,12,13)]
colnames(z) = c("gene_id", "variant_id", "tss_distance", "ma_samples", "ma_count", "maf",
                     "pval_nominal", "slope", "slope_se")
gene_snp_ids = GTExIdConvert(z$variant_id)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_character(),
##   X2 = col_character()
## )
## [1] 1
## [1] 1
z = merge(z, gene_snp_ids, by = "variant_id")
  
eqtl_b4galnt3 = separate(z, variant_id, into = c("chr", "pos", "ref", "alt", "build"), sep = "_")

# Filtering GWAS data for plot
gwas_b4galnt3 = gwas_sumstat %>%
  filter(CHR == as.numeric(eqtl_b4galnt3$chr[1])) %>%
  filter(BP > min(eqtl_b4galnt3$pos)) %>%
  filter(BP < max(eqtl_b4galnt3$pos))

# Using RACER package to generate mirror plot
gwas_b4galnt3_formatted = formatRACER(assoc_data = gwas_b4galnt3, chr_col = "CHR", pos_col = "BP", p_col = "P", rs_col = "SNP")
## Formating association data...
## Processing -log10(p-values)...
## Preparing association data...
eqtl_b4galnt3_formatter = formatRACER(assoc_data = eqtl_b4galnt3, chr_col = "chr", pos_col = "pos", p_col = "pval_nominal", rs_col = "rs_id")
## Formating association data...
## Processing -log10(p-values)...
## Preparing association data...
gwas_b4galnt3_ld = ldRACER(assoc_data = gwas_b4galnt3_formatted, rs_col = "RS_ID", pops = "EUR", lead_snp = "rs215226")
## All inputs are go!
## Reading in association data...
## Populations selected.
## Calculating LD using rs215226...
## [1] "https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs215226&pop=EUR&r2_d=r2&token=c0f613f149ab"
## Merging input association data with LD...
eqtl_b4galnt3_ld = ldRACER(assoc_data = eqtl_b4galnt3_formatter, rs_col = "RS_ID", pops = "EUR", lead_snp = "rs215226")
## All inputs are go!
## Reading in association data...
## Populations selected.
## Calculating LD using rs215226...
## [1] "https://ldlink.nci.nih.gov/LDlinkRest/ldproxy?var=rs215226&pop=EUR&r2_d=r2&token=c0f613f149ab"
## Merging input association data with LD...
mirrorPlotRACER(assoc_data1 = eqtl_b4galnt3_ld, assoc_data2 = gwas_b4galnt3_ld, chr = 12, name1 = "B4GALNT3 eQTL", name2 = "BMD GWAS", plotby = "coord", start_plot = 500000, end_plot = 700000, label_lead = TRUE)
## Plotting by...
## coord
## Reading in association data
## Generating plot.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Warning: Removed 1 row(s) containing missing values (geom_path).

(7) PhenStat Analysis

While the colocalization analysis provides evidence supporting a relationship between network identified genes and a trait of interest, a causal relationship can only be demonstrated through controlled perturbation of a target and direct measurement of the phenotype of interest. While the hypotheses here can lead to a novel set of experiments, databases of experimental perturbations and measured phenotypes can be mined for evidence supporting a causal relationship between a gene and a phenotype of interest. For example, the International Mouse Phenotyping Consortium has a database of phenotypes measured in 7022 strains of knockout mice (IMPC release 12.0). - search for gene on main page https://www.mousephenotype.org/ - go to gene page https://www.mousephenotype.org/data/genes/MGI:3041155 - click all data table, search for relevant phenotype - click into a relevant phenotype https://www.mousephenotype.org/data/charts?accession=MGI:3041155&allele_accession_id=MGI:4434237&pipeline_stable_id=ESLIM_001&procedure_stable_id=ESLIM_005_001&parameter_stable_id=ESLIM_005_001_004&zygosity=homozygote&phenotyping_center=ICS - Right click and copy the link to download the “PhenStat-ready raw experiment data” - Copy the url into the url defition in the code chunk below

# Download the file
url = 'https://www.mousephenotype.org/data/exportraw?phenotyping_center=ICS&parameter_stable_id=ESLIM_005_001_004&allele_accession_id=MGI:4434237&strain=MGI:2164831&pipeline_stable_id=ESLIM_001&&zygosity=homozygote&'
dataset1 = data.table::fread(url)
# Change the date format, strip only the columns we need to do the statistical comparison
dataset1 = as.data.frame(dataset1)
dataset1 = dataset1[,c(15,16,21,26,28)]
dataset1$Assay.Date = as.character(dataset1$Assay.Date)
dataset1

Next, we create a test object for PhenStat that will go into the statistical test

test<-PhenList(dataset1,
  testGenotype="EPD0140_5_G02",
  dataset.colname.batch = "Assay.Date",
  dataset.values.male="male",
  dataset.values.female="female",
  dataset.clean=TRUE, 
  outputMessages = TRUE)
## Information:
## Dataset's 'Genotype' column has following values: '+/+', 'EPD0140_5_G02'
## Information:
## Dataset's 'Sex' column has following value(s): 'Female', 'Male'
test@datasetPL
PhenStat:::getStat(test)
##   Variables Numeric Continuous Levels NObs        Mean      StdDev     Minimum
## 1     Batch   FALSE      FALSE    104  920          NA          NA          NA
## 2  Genotype   FALSE      FALSE      2  920          NA          NA          NA
## 3       Sex   FALSE      FALSE      2  920          NA          NA          NA
## 4     Value    TRUE       TRUE    909  920  0.04464975 0.003652078  0.03730429
## 5    Weight    TRUE       TRUE    130  920 24.60663043 2.951539971 17.50000000
##       Maximum
## 1          NA
## 2          NA
## 3          NA
## 4  0.06429247
## 5 32.70000000

Next, use testDataset to test for differences in a dependant variable, here "Value", which is the BMD measurement. The program will choose whether to keep specific model effect, and in this case it corrects for batch, weight, and sex, but not an interaction term. Additionally, it does not detect a difference variances.

results_MM_bmd = testDataset(test, depVariable = "Value")
## Information:
## Dependent variable: 'Value'.
## Information:
## Perform all MM framework stages: startModel and finalModel.
## Information:
## Method: Mixed Model framework.
## Information:
## Equation: 'withWeight'.
## Information:
## Calculated values for model effects are: keepBatch=TRUE, keepEqualVariance=TRUE, keepWeight=TRUE, keepSex=TRUE, keepInteraction=FALSE.

Use summary output to view the results of the statistical test for comparing BMD. You can see that there is an effect of genotype, there is no sexual dimorphism, and the effect of weight in the model was significant.

summaryOutput(results_MM_bmd)
## 
## Test for dependent variable:
## *** Value ***
## 
## Method:
## *** Mixed Model framework ***
## ----------------------------------------------------------------------------
## Model Output
## ----------------------------------------------------------------------------
## Final fitted model: Value ~ Genotype + Sex + Weight
## Was batch significant? TRUE
## Was variance equal? TRUE
## Genotype p-value: 5.241566e-02
## Genotype effect: -1.957718e-03 +/- 1.008667e-03
## Was there evidence of sexual dimorphism? no (p-value 8.495508e-01)
## Genotype percentage change Female: -4.38%
## Genotype percentage change Male: -4.38%
## 
## ----------------------------------------------------------------------------
## Classification Tag
## ----------------------------------------------------------------------------
## With phenotype threshold value 0.01 - no significant change
## 
## ----------------------------------------------------------------------------
## Model Output Summary
## ----------------------------------------------------------------------------
##                               Value    Std.Error  DF   t-value      p-value
## (Intercept)            0.0239276920 1.301252e-03 813 18.388211 2.074091e-63
## GenotypeEPD0140_5_G02 -0.0019577183 1.008667e-03 813 -1.940897 5.261613e-02
## SexMale               -0.0010626637 3.414265e-04 813 -3.112423 1.920506e-03
## Weight                 0.0008632046 5.786314e-05 813 14.918039 1.158340e-44

Finally, we can create a boxplot of the differences in the value between genotypes for both sexes.

boxplotSexGenotype(test,
                   depVariable = "Value",
                   graphingName = "Bone Mineral Density",
                   outputMessages = T)

PhenStat:::boxplotSexGenotypeBatchAdjusted(test, depVariable = "Value")
## Information:
## Value variable is adjusted treating batch as a random effect.